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Abstract 

Nearly a quarter of genomic sequences and almost half of all receptors that 
are likely to be targets for drug design0 are integral membrane proteins. Un- 
derstanding the detailed mechanisms of the folding of membrane proteins is a 
largely unsolved, key problem in structural biology. Here, we introduce a gen- 
eral model and use computer simulations to study the equilibrium properties 
and the folding kinetics of a Co-based two helix bundle fragment (comprised 
of 66 amino-acids) of Bacteriorhodopsin. Various intermediates are identified 
and their free energy are calculated toghether with the free energy barrier 
between them. In 40% of folding trajectories, the folding rate is considerably 
increased by the presence of non-obligatory intermediates acting as traps. In 
all cases, a substantial portion of the helices is rapidly formed. This initial 
stage is followed by a long period of consolidation of the helices accompa- 
nied by their correct packing within the membrane. Our results provide the 
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framework for understanding the variety of folding pathways of hehcal trans- 
membrane proteins. 
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Considerable effort has been expended to understand the dynamics of the folding and 
biological functionality of proteins. Whereas the behavior of small water soluble globular 
proteins is reasonably well understood both experimentally and theoreticallySi, much less 
is known about membrane proteins (MP)&^^ that cross biological membranes. Transmem- 
brane proteins (TMP) are the most important and best studied class of Mpiii. They are 
characterized by the presence in their primary structure of long segments (20 — 30) of amino 
acids with a high degree of hydrophobicity. In the native structure, these correspond to the 
transmembrane segments which are inserted in the lipidic interior of the membranei. These 
segments are predominantly made up of a-helices and /5-sheets. The stability of a-helices 
and /3-sheets inside the membrane follow from the formation of hydrogen bonds between the 
backbone atoms - other possibilities are excluded within the apolar enviromentB0. 

Phenomenological models have proved to be powerful for interpreting experimental data. 
The most common of these is the Two-Stage model based on experimental evidence that 
the folding of TMP occurs in two stages. In the first stage, a-helices and /3-sheets are 
formed with the full native state structure being formed in a distinct second stag M A 
more refined modeli takes into account four main steps: partitioning, folding, insertion and 
association. Recently, Pappu et al.0 have used a potential smoothing algorithm to predict 
transmembrane helix packing in good accord with experimental data. 

Milik and Skolnickiii have carried out careful Monte Carlo studies of the insertion 
of peptide chains into lipid membranes and have proposed a new hydropathy scale based 
on experimental data obtained by studying the interactions of tripeptides with phospho- 
lipid membraneS and the self-solvation effect in protein system^!. Recently Wimley and 
WhiteS have designed transmembrane peptides that spontaneously insert across bilayers but 
yet have measurable monomeric water stability, opening the way for the determination of 
the thermodynamic cost of partitioning hydrogen bonded peptide bonds into the membrane 
hydrocarbon core. 

The Monte-Carlo results of Milik and Skolnickiii are in good accord with Engelman and 
Steitz's helical hairpin hypothesisS further extended by Jacobs and WhiteElJii. The unfolded 



chain is first adsorbed onto the membrane interface, driven mostly by the hydrophobic 
effect and electrostatic lipid-protein interactions&ll. A polypeptide chain has a greater 
possibility, while anchored to the interface, of saturating its internal hydrogen bonds and 
forming helices. Such helical fragments have a greater propensity to subsequently diffuse 
into the lipid phase. 

A detailed study of TMP has not yet been possible because little is known about the inter- 
actions between amino acids inside the membrane or between them and the lipid molecules. 
Here, we adopt a simple, yet powerful, strategy for attacking the folding properties of TMP 
that circumvents this shortcoming. Our novel approach is based on extensive studies of 
the folding of globular proteins which have underscored the important role played by the 
topology of the native state in controlling both the functionality and the main features of 
the folding process. Nature uses a rich repertory of twenty kinds of amino acids with some- 
times major and at other times subtle differences in their interactions with the solvent and 
with each other in order to design sequences that fit the putative native state with minimal 
frustrationii. Thus a fruitful and general strategy for the study of protein folding would be 
to extract information on the folding process directly from the topology of the native state. 

Our study here focuses on the folding process by using a tractable approach (described 
in the Methods Section) that by-passes the details of the complex interactions of the protein 
in the lipid enviroment by introducing effective potentials, induced by the presence of the 
membrane and the associated interface region, that stabilize the native state structure. The 
validity of the approach based on the native state topology, in the case of globular pro- 
teins, has been confirmed a posteriori from the agreement between theory and experimental 
findingsi&^. The approach proposed here is similar in spirit and ought to be a tool and a 
guide for the difficult experimental situation of TMP0. Our model allows a complete char- 
acterization of the thermodynamics and the dynamics of the full folding process. Due to the 
small number of degrees of freedom involved, the dynamics of the system can be simulated 
for the full folding process. Moreover, the free energies of the most relevant intermediate 
states and free energy profiles along the reaction paths connecting them can be explicitly 



calculated by thermodynamic integration (see Methods). Thus the model is able to quanti- 
tatively discriminate between the possible reaction paths envisaged for the insertion process 
of TMP across the membrane@, a feature that is not an obvious consequence of the structure 
of the model. Where there is overlap, our model captures the qualitative features of the 
earlier simulations of Milik and Skolnickili. 

The TMP we considered is made up of the first 66 amino acids of bacteriorhodopsin 
consisting of two a-helices (Fig. la). It has been shown that the first two helices of bacteri- 
orhodopsin can be considered as independent folding domainJ^. Furthermore, the side-by- 
side interactions between transmembrane helices play a key role in the stabilization of the 
protein structured. 

Our studies were carried out using a Monte Carlo algorithm that has been proven to be ex- 
tremely efficient for interacting hetero-polymers (Methods). The behaviour of the structural 
similarity between the system equilibrated at temperature T (measured in dimensionless 
units) and the native state is shown in Figure 16 in terms of the average fraction of native 
state contacts as a function of T and partitioned depending on their positions with respect 
to the membrane. The three curves correspond respectively to the average fraction of native 
contacts inside (g^) , outside (g^) and across (g^) the membrane (see Methods). All these 
curves, well separated at high T, collapse for T below the transition temperature Tc ~ 0.6, 
indicating a cooperative effect in the folding. On monitoring the free energy as a function 
of the energy around Tc, one observes additional local minima (besides those corresponding 
to the unfolded and folded states) suggesting the presence of an intermediate. 

The intermediate is characterized by having the two helices almost completely formed 
but not yet correctly inserted across the membrane. A metastable state in which the protein 
exists at the membrane interface ought to be expected on general grounds. Indeed a generic 
heteropolymer with hydrophobic and hydrophilic aminoacids, of which a TMP is a particular 
case, has a favorable conformation which is localized near surfaces between two selective 
media (the outside and the inside part of the membrane in the present case)ii'ii. At not too 
high temperatures, the gain in energy to place hydrophobic/hydrophilic protein segments in 



their preferred enviroment compensates the entropy loss for being locahzed at the interface 
with respect to remain in the bulk phase. Thus, even though our model does not explicitly 
contain information on the character of the amino acids, it is able to predict this feature. 

The presence of these extra minima suggests that non-constitutive membrane proteins 
would fold with multi-state kinetics corresponding to on-pathway intermediates. To establish 
their nature of and their influence on the dominant folding pathways, we have performed 
a detailed analysis of the folding kinetics. Each independent kinetic folding simulation was 
started with the equilibrated denaturated state at T* = 2.5 . The protein is placed initially 
outside the membrane in the interface regioni, at a distance comparable to the average 
size of the denatured protein and then suddenly quenched to a temperature (T = 0.4) well 
below the transition temperature. This case simulates the folding kinetics of non-constitutive 
membrane proteins, i.e. proteins that do not need a translocon providing a 'tunnel' through 
which the protein is injected into the lipid bilayer. Folding to the native state occurs mainly 
through the states depicted in Figure 2a with the dominant pathways shown in Figure 2b. 

In all the pathways, the system goes from the unfolded state, U to state HI in which 80% 
of the secondary structure is formed (see q in Figure 3c) and disposed horizontally along the 
interface. The free energy of this state (measured with respect to the free energy of the fully 
folded state) is ~ 2.4 Tc- This state corresponds to the formation of around 70 % of the 
membrane contacts. The average time t^i to reach state HI is of the order of 500 Monte 
Carlo steps (see Figures 3 and 4; each Monte-Carlo step corresponds to 50000 attempted 
local deformations.). State HI turns out to be an obligatory on-pathway intermediate of the 
folding kinetics for non-constitutive MP in agreement with the general argument mentioned 
above. Once the protein reaches state HI, it undergoes a relatively slow process of self- 
arrangement in order to insert and assemble the secondary structures across the membrane. 
This process is the rate-limiting step of the folding process, since it involves the translocation, 
through the lipidic layer, of a substantial number of hydrophilic residues. Among the possible 
pathways, starting from HI, the most frequent (60% of the cases) and the fastest turn out 
to he U ^ HI HV ^ N. A quantitative characterization of this dominant pathway 
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is presented in Figures 3 (for a single folding process) and 4 (as an average over 40 folding 
processes). The intermediate HV is characterized by having one a helix inserted across the 
membrane and is reached in an average period corresponding to a significant fraction of the 
total folding time (see Figure 3). The free energy in this state is ~ 0.98 Tc- The free energy 
barrier between HI and HV is at ~ 4.31 Tc (hence, the rate constant of the transition 
HI — > HV is proportional to kni^HV = exp (— (4.31 — 2.4) Tc/T)). The full free energy 
profile versus a reaction coordinate is shown in Fig. 5. The last part of the folding process 
corresponds to the insertion of the second helix and the assembly of the two secondary 
structures into the native state structure. This process lasts approximately one third of the 
folding time along the pathway U — > HI — > HV — > N. The quasistatic free energy barrier 
between HV and the folded state is ~ 1.66 T(7. The rate costant of the transition HV — > N 
is, therefore, proportional to exp (— (1.66 — 0.98) Tc/T). These results are consistent with 
the time scales observed in the unconstrained folding dynamics. At the end, the protein 
is completely packed, (g^ saturates to 1 (Figures 3a and 4a)) and the helices are correctly 
positioned across the membrane (note the second jump in the z coordinate of the center of 
mass in Figures 36 and 46). 

Much slower dynamics can occur when non-obligatory intermediates are visited by the 
system. These long lived states ({/} in Figure 2a) involve a distribution of misfolded regions 
that trap the system and are characterized by having most of the inter-helical contacts 
formed (assembly of the secondary structures) but with the two a-helices still incorrectly 
positioned. Note, for example, that in states {/} , only transmembrane contacts and some 
contacts outside the membrane are misplaced and they account for only a small fraction 
of the native state energy. For this reason, in the states {/}, the free energy is ~ 1.44 
Tc, only slightly higher than the free energy of HV . The folding can proceed from {/} 
either by disentangling the two helices and passing through the obligatory intermediate 
HV, or by the simultaneous translocation through the membrane of the two helices. These 
processes, however, entail the crossing of a big free energy barrier (~ 5.18 Tc for the first 
process and 6.1 for the second) and happen with low probability. Indeed, at sufficiently 

7 



low temperatures, the loss in energy of the interhelical contacts is not compensated by the 
gain in the configurational entropy due to the uncoupling of the a— helices. Thus below the 
folding temperature, I-states act as trapping regions for the system and when trapped, the 
protein spends most of the time during folding in this state. 

In summary, we have presented detailed calculations of helical transmembrane proteins 
leading to a vivid picture of the folding process. Our strategy relies on the dominant role 
played by the topology of the native state structure and by the effective geometry imposed 
by the membrane and provides a picture which would be expected to be quite accurate for 
well-designed sequences that are a good fit to the target native state conformation. It is 
interesting to note that , with our choice of the parameters, the pathway in which the helices 
assemble outside the membrane and are inserted later is unlikely to occur. 

Models based on the topology of the native state structure have been remarkably 
successfulil^S in correctly describing the main features of the folding process determined 
in experiment si'0@i2F0 for various globular proteins. A similar approach has been gen- 
eralized here to the almost virgin field of transmembrane proteins where experiments are 
rather difficultElSB. Our findings do not depend on the precise values of the e parameters 
introduced in the model underscoring the robustness of the results. Our approach predicts a 
folding process involving multiple pathways with a dominant folding channel. The simpliciy 
of our model allows for a quantitative description of all the pathways since we can monitor 
the correct /uncorrect formation of native contacts and compute free energy profiles. Further 
details not captured by the present approach arising from amino-acid specific interactions 
among themselves, with the solvent and in particular with the interior of the membrane 
may of course change the quantitative nature of the results. However, our model, which 
captures the bare essentials of a membrane protein, ought to provide a zeroth order picture 
of the folding process. Also, as experimental data becomes available, the results could be 
benchmarked with models of this type to glean the other factors that matter. 
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I. METHODS 



We represent the residues of the membrane protein as single beads centered in their 
Ca positions. Adjacent beads are tethered together into a polymer chain by a harmonic 
potential with the average Ca — Ca distance along the chain equal to 3.8A. The membrane 
is described simply by a slab of width w = Zmax — -^min = 26A. Two non-bonded residues 
form a contact if their distance is less then 6.5A. In the study of globular proteins, 
the topology of the native state is encoded in the contact map giving the pairs of 
non-bonded residues that are in contact. Here, in addition, the locations of such pairs 
with respect to the membrane becomes crucial. The contacts are divided into three classes: 
membrane contacts where both i and j residues are inside the membrane, interface contacts 
with i and j in the interface region! outside the membrane and surface contacts with one 
residue inside the membrane and the other outside. Thus a given protein conformation can 
have a native contact but improperly placed with respect to the membrane {misplaced native 
contact). The crucial interaction potential between non-bonded residues is taken to be 
a modified Lennard- Jones 12-10 potential: 
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The matrices and ri{i,j) encode the topology of the TMP in the following way: if 

{i,j) is not a contact in the native state = 0,ri(i,j) = 1; if is a contact in the 

native state but not at the proper location (i.e. a misplaced contact) T{i,j) = ei, Ti{i,j) = 0; 
if (i, j) is a native state contact in the proper region T{i,j) = e, Ti{i,j) = 0. This model 
is intended to describe the folding process in the interface and in the membrane region. 
Our interaction potential (similar in spirit to a well known model0 for globular proteins 
(see also other approaches that model helix formationiHl)) assigns two values to the energy 
associated with the formation of a native contact, e and ei. 

The model captures the tendency to form native contacts. In addition, in order to 
account for the effective interactions between the membrane and the protein, the model 
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assigns a lower energy, — e, to the contact which occurs in the same region as in the native 
state structure compared to — ei when the contact is formed but in the wrong region of 
space. This feature proves to be crucial in determining the mechanism of insertion of the 
protein across the membrane in order to place all native contacts in the same regions as in 
the native state. Even though the interaction potential is simple and intuitively appealing, 
it is not possible to simply guess (without detailed calculations) the folding mechanism and 
quantitatively determine the probability of occurrence of the various folding pathwaysi. 

When e = ei, the protein does not recognize the presence of the interface- membrane 
region and the full rotational symmetry is restored (the system behaves like a globular 
protein). The difference in the parameters (e — ei) controls the amount of tertiary structure 
formation outside the membrane. When the difference is small, the protein assembles almost 
completely outside the membrane and the insertion process would be diffusion limited. Our 
results are independent of the precise values of the energy parameters e and ei (e > ei) as 
long as they are not too close to each other. 

We report here the results of simulations with ei = 0.1 and e = 1. r^j and dij are 
the distance between the two residues and their distance in the native configuration, 
respectively. In order to account for the chirality of the TMP, a potential for the pseudodi- 
hedral angle ai between the Ca atoms in a helix corresponding to four successive locations 
is added which biases the helices to be in their native state structure. 

The thermodynamics and the kinetics of the model were studied by a Monte Carlo 
method for polymer chains allowing for local deformations. The efficiency of the program 
(usually low for continuum calculations) has been increased by full use of the link cell 
technique^ and by the multiple Markov chain method, a new sampling scheme, which has 
been proven to be particulary efficient in exploring the low temperature phase diagram for 
polymers^. In our simulation 20 different temperatures ranging from T = 2 to T = 0.17 
have been studied. The free energy is calculated by reweighting the different temperatures 
with the Ferrenberg-Swendsen@ algorithm. 

The free energy difference J-'b — Ta between two states A and B has been estimated as 
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the reversible work that has to be done in order to go from A to B. Hence, denoting by 
X (A) a reaction coordinate connecting A and B (for A = and A = 1 the system is in A 
and B respectively), and by (•)^ = (5 (x — x (A)) • ), the canonical average at fixed reaction 
coordinate, 

- = i; ,X (F), . . S (F), . ^1 (A.,. - A.) (2) 

I A- 2 

where F is the force and {Aj, i = 1, . . .} is a suitably dense partition of the interval (0, 1). 
The average value (F);^, at each A^ is computed by a long (more than 5000 steps) Monte 
Carlo run performed with dynamics satisfying the constraint x = x (A^) . The free energy 
differences obtained with this method are accurate to within ~ 0.1 7^ for the various states 
whereas the free energy barriers are accurate within ~ 0.5 Tf; . This error takes into account 
possible hysteresis effects due to the finite simulation time. 
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II. FIGURE LEGENDS 



Figure 1: Structure and thermodynamics of tlie lielical transmembrane protein, 
a) Ribbon representation of the two-helix fragment of bacteriorhodopsin formed 
by the first 66 amino-acids. The part inside the membrane (determined 
by using the neural network learning algorithm available at http:/ /www.embT^ 
heidelberg.de / Services / sander / predictprotein /|) is shown in red, the part above (below) the 



membrane in blue (green), b) Average equilibrium fraction of native contacts outside, 
(o), inside, (n), and across, (A), the membrane as a function of the temperature T. 
All these quantities are expressed in energy unit of e (see Methods). The folding transition 
temperature Tc when all the curves cross the value 1/2 is around 0.6. This value is in accord 
with the temperature of the heat capacity maxima. 

Figure 2: Schematic representation of states encountered by non- constitutive proteins 
during the folding process. 

In a) the red cylinders denote a-helices that reside within the membrane in the native 
state. The region inside the membrane is in turquoise whereas the rest represents the 
interface region^ in which the folding process starts. State U denotes the denatured state 
of the protein, HO is a state in which the helices have been formed but are not yet inside 
the membrane whereas HI corresponds to a similar state but with the helices completely 
embedded in the membrane without any inter-helical contacts. Usually the helices form and 
enter into the membrane separately. HV denotes an obligatory intermediate and N depicts 
the native state. The state { J} represents an ensemble of long lived conformations in which 
helices are formed inside the membrane with several inter-helical contacts, but with the two 
a-helices still incorrectly positioned. This conformations differ in term of packing efficiency 
of the helices. The state {/} is not obligatory for the folding kinetics. In b) the schematic 
pathways to the native state are shown. In the most directed path, the entropy decreases 
on going from U to N. From HI to HV the entropy loss of one helix is not compensated by 
a corresponding energy gain until both helices become vertical. This is the principal origin 
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of the high free energy barrier between the state HI and the native state. 

Figure 3: Typical time dependence of different parameters as a function of the Monte- 
Carlo steps for the pathway U — > HI — > HV —>■ N. Fraction of native contacts inside the 
membrane (a) , normalized z-coordinate of the center of mass of the protein (with respect to 
that of the native state conformation) (b) and overall fraction of native hehcal contacts (c). 
Each Monte-Carlo step corresponds to 50000 attempted local deformations. The transition 
from state HI to state HV is signalled by a sharp jump of the position of the center of 
mass. Note that there is no perceptible sign of this transition in terms of newly formed 
native contacts. Most of the helical contacts are formed in the early stages of the folding. 
This fraction does not significantly increase until helices correctly assemble and the inter- 
hehcal contacts are formed. The HV — > N transition is reached by a progressive zippering 
of the horizontal and vertical helices. This zippering is usually very quick (few MC steps) 
and is only slightly slowed down (see the plateau corresponding to qm ~ 0.9 in a) when the 
trajectory passes through somewhat deformed conformations, (d) Protein conformations at 
different times during the folding. The colours red, green and blue have the same significance 
as in Figure la with the grey bonds being ones crossing the membrane. 

Figure 4: Distribution of the fraction of native contacts inside the membrane 
(a) and of the normalized 2;-coordinate of the center of mass {Rz = ^^) (b). The data were 
obtained using 40 independent kinetic simulations with pathway U — > HI — > HV — > N. 
The grey scale distribution indicates the probabilities at given times: darker points denote 
higher probability. 

Figure 5: Free energy profiles along three reaction coordinates at T=0.85Tc. The 
continuous lines are sphne fits to the free energy data (crosses) . To obtain free energy 
differences between two states we estimate the reversible work that has be done to go from 
one state to the other. For this purpose, we fix the z coordinate of a specific residue in order 
to compute the canonical average of the force and then apply eq. (2) (See Methods). The 
free energy of the native state is defined to be equal to 0. (a) Free energy as a function of 
the 2;-coordinate of the 58-th residue {z — corresponds to the middle of the membrane) 
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starting from HV; this forces the second hehx to cross the membrane as the protein goes 
from HV to N; the local minimum at z~20 corresponds to a state topologically equivalent 
to HV, with the helix containing the 58-th residue fully formed on the membrane interface 
but without any contact with the first helix (in HV some of the inter-helices contacts are 
already formed); (b) the 5-th residue is translocated across the membrane with the protein 
starting from state HI and proceeding to HV; (c) the same as in (b) , but the initial state is 
I (see Fig. 2-a) 
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FIG. 4. Seno 
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